# Multi-frame Analysis
Cogvlm2 Llama3 Caption
Other
CogVLM2-Caption is a video caption generation model used to generate training data for the CogVideoX model.
Video-to-Text
Transformers English

C
THUDM
7,493
95
Vivit B 16x2 Kinetics400
MIT
ViViT is an extension of the Vision Transformer (ViT) for video processing, particularly suitable for video classification tasks.
Video Processing
Transformers

V
google
56.94k
32
Featured Recommended AI Models